Skip to content

fix: enable multi-GPU DDP training in Jupyter notebooks#928

Merged
Borda merged 35 commits intoroboflow:developfrom
mfazrinizar:fix/ddp-notebook-cuda-init
Apr 8, 2026
Merged

fix: enable multi-GPU DDP training in Jupyter notebooks#928
Borda merged 35 commits intoroboflow:developfrom
mfazrinizar:fix/ddp-notebook-cuda-init

Conversation

@mfazrinizar
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes multi-GPU DDP training (strategy="ddp_notebook" and strategy="ddp_spawn") which was completely broken in Jupyter (e.g. Kaggle) notebook environments. The fix addresses two layers of issues:

  1. CUDA early initialization: RFDETRBase() eagerly moved the model to CUDA during __init__(), and module-level torch.cuda.is_available() in config.py created a CUDA driver context at import time, making multi-process training impossible.

  2. OpenMP thread pool corruption after fork: Even after fixing CUDA init, PyTorch's OpenMP thread pool (created during model construction) cannot survive fork(). The worker threads become zombie handles, causing SIGABRT: Invalid thread pool! when the autograd engine initializes in forked children. Fixed by transparently replacing fork-based DDP with a spawn-based strategy.

Related Issue(s): Fixes #923

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Testing

  • I have tested this change locally
  • I have added/updated tests for this change

Test details:

Unit tests (101 pass locally)

  • test_build_trainer.py: 52 tests covering precision resolution, strategy selection, ddp_notebook→spawn mapping, EMA guards, logger wiring
  • test_module_data.py: 49 tests including test_ddp_notebook_preserves_num_workers and test_other_strategy_preserves_num_workers

Integration test (Kaggle T4 x2)

Validated on Kaggle GPU T4 x2 accelerator (Python 3.12, PyTorch 2.10.0+cu128, PTL 2.6.1):

Test Result Time
CUDA not initialized after RFDETRBase() ✅ PASS
Model weights on CPU after construction ✅ PASS
strategy="ddp_notebook" training (3 epochs, 2×T4) ✅ PASS 84.3s
strategy="ddp_spawn" training (3 epochs, 2×T4) ✅ PASS 77.4s
Inference after DDP training ✅ PASS

What This Fixes

Scenario Before After
model.train(devices=2, strategy="ddp_notebook") in notebook ❌ CUDA re-init / SIGABRT ✅ Works
model.train(devices=2, strategy="ddp_spawn") in notebook ❌ CUDA re-init / MisconfigurationException ✅ Works
model.train(devices=1) ✅ Works ✅ Works (no regression)
model.predict(img) ✅ Works ✅ Works (lazy device placement)
model.train() → model.predict(img) ✅ Works ✅ Works
model.export_onnx() / model.optimize_for_inference() ✅ Works ✅ Works

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code where necessary, particularly in hard-to-understand areas
  • My changes generate no new warnings or errors
  • I have updated the documentation accordingly (if applicable)

Additional Context

The ddp_notebook → spawn conversion is transparent to users: they continue passing strategy="ddp_notebook" (or strategy="ddp_spawn") and training just works. An INFO log message is emitted:

[INFO] rf-detr - ddp_notebook → spawn-based DDP to avoid OpenMP thread pool corruption after fork.

The find_unused_parameters=True flag is required because RF-DETR's architecture has parameters in the detection head that may not contribute to every loss term (e.g. encoder-only auxiliary losses).

Technical Details

Two layers of CUDA initialization that had to be fixed

  1. Module-level (config.py): torch.cuda.is_available() creates a CUDA driver context at import time. Fixed with torch.accelerator.current_accelerator() which queries NVML without creating a primary context.

  2. Model construction (inference.py): nn_model.to("cuda") fully initializes the CUDA runtime. Fixed by keeping the model on CPU and deferring .to(device) to first predict()/export()/batch_size="auto" call via _ensure_model_on_device().

Why spawn instead of fork

PyTorch creates an OpenMP thread pool (default 8 threads) during the first tensor operation (model construction). fork() only copies the calling thread, OMP worker threads become zombie handles. When the autograd engine in forked children calls set_num_threads during thread_init, the OMP runtime finds an invalid pool state and aborts:

terminate called after throwing an instance of 'c10::Error'
  what(): pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64

This is a fundamental fork+OMP incompatibility; as far as I know, there is no library-level workaround. The fix transparently replaces fork-based ddp_notebook with a spawn-based _NotebookSpawnDDPStrategy whose launcher is marked is_interactive_compatible = True, allowing PTL to accept it in notebook environments.

Performance impact

  • First predict() call: ~50-200ms one-time latency from CPU→GPU model transfer. Strictly one-time, _ensure_model_on_device() checks first_param.device != target and becomes a no-op once the model is on GPU. After train(), the PTL-trained model is already on CUDA (synced at line 548), so even the first post-training predict() has zero transfer cost.
  • Subsequent predict() calls: Zero overhead (single next(parameters()).device comparison)
  • Production inference (RFDETRBase() → predict() without training): The one-time transfer happens on the very first call only. All subsequent calls, including batch evaluation loops, are zero-overhead.
  • Training: Zero impact (PTL builds its own model on CPU and handles device placement)
  • DDP spawn vs fork: ~12s additional startup for process spawn (one-time per training run)

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 89.83051% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (3f3bab3) to head (b4e82e4).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files
@@          Coverage Diff           @@
##           develop   #928   +/-   ##
======================================
  Coverage       79%    79%           
======================================
  Files           97     97           
  Lines         7793   7846   +53     
======================================
+ Hits          6148   6195   +47     
- Misses        1645   1651    +6     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Borda Borda added the bug Something isn't working label Apr 8, 2026
@Borda Borda requested a review from Copilot April 8, 2026 16:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes multi-GPU DDP training in interactive notebook environments by preventing early CUDA initialization and by transparently switching notebook DDP strategies away from fork to a spawn-based launcher/strategy.

Changes:

  • Add a notebook-safe spawn-based DDPStrategy replacement for ddp_notebook / ddp_spawn in the trainer factory.
  • Defer inference-model .to(device) until first use via a new lazy device-placement helper.
  • Replace direct torch.cuda.is_available() checks with a device constant intended to avoid CUDA context creation at import time, and update tests accordingly.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/rfdetr/config.py Introduces _detect_device() / DEVICE to avoid CUDA runtime init at import time.
src/rfdetr/inference.py Stops eager model .to(device) during model-context construction to prevent early CUDA init.
src/rfdetr/detr.py Adds _ensure_model_on_device() and calls it from inference/export/optimize/auto-batch paths.
src/rfdetr/training/trainer.py Maps ddp_notebook/ddp_spawn to a spawn-based, interactive-compatible DDP strategy.
src/rfdetr/training/module_model.py Uses config.DEVICE for compile gating instead of torch.cuda.is_available().
src/rfdetr/training/module_data.py Uses config.DEVICE for pin_memory decisions; preserves configured num_workers.
tests/training/test_build_trainer.py Adds coverage for spawn-based DDP mapping and ddp_notebook precision probing.
tests/training/test_module_data.py Adds tests asserting num_workers/prefetch_factor preservation for strategies.
tests/training/test_module_model.py Updates compile test to patch config.DEVICE instead of torch.cuda.is_available().

Borda and others added 6 commits April 8, 2026 19:09
- Adds `hasattr(torch, "accelerator")` outer guard in `_detect_device()`
  so PyTorch < 2.4 (where `torch.accelerator` module does not exist)
  does not raise AttributeError at import time

---
Co-authored-by: Claude Code <noreply@anthropic.com>
- Assertions are stripped with `python -O`; use explicit if+raise for
  required runtime guards

---
Co-authored-by: Claude Code <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- _MultiProcessingLauncher has no public equivalent in PTL 2.x; adds a
  comment to monitor for breakage when bumping the PTL lower bound

---
Co-authored-by: Claude Code <noreply@anthropic.com>
Borda and others added 7 commits April 8, 2026 19:13
- Old docstring said "moves the model to the target device" and "ready
  for inference", both no longer true; model is kept on CPU and moved
  lazily by _ensure_model_on_device on first use

---
Co-authored-by: Claude Code <noreply@anthropic.com>
- Satisfies static analysis requirements; function accepts duck-typed
  stand-ins, which Any correctly reflects

---
Co-authored-by: Claude Code <noreply@anthropic.com>
…n DDP

- torch.cuda.is_available() + is_bf16_supported() initialize CUDA in the
  parent; add a comment documenting this is intentional because all DDP
  paths use spawn, not fork

---
Co-authored-by: Claude Code <noreply@anthropic.com>
- Inline comment inside build_trainer() was a near-verbatim repeat of
  the module-level block; replaced with a brief cross-reference

---
Co-authored-by: Claude Code <noreply@anthropic.com>
…ect_device fallback

- test_train_auto_batch_ensures_model_on_device_before_resolve: verifies
  device placement happens before auto-batch probing (detr.py:512-516)
- test_detect_device_falls_back_when_torch_accelerator_absent: simulates
  PyTorch < 2.4 with no torch.accelerator module
- test_detect_device_falls_back_when_current_accelerator_raises: covers
  RuntimeError catch path
- test_detect_device_returns_cpu_when_no_gpu: covers CPU-only fallback

---
Co-authored-by: Claude Code <noreply@anthropic.com>
---
Co-authored-by: Claude Code <noreply@anthropic.com>
Borda and others added 4 commits April 8, 2026 19:32
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
- TestDetectDevice: use @patch decorator + MagicMock(spec=[]) to simulate
  missing current_accelerator without PropertyMock or class mutation
- test_train_auto_batch_ensures_model_on_device_before_resolve: convert
  to @patch decorators, drop unused tmp_path, remove spurious
  rfdetr.detr.resolve_auto_batch_config patch (local import means only
  rfdetr.training.auto_batch is the correct target), explicit side_effect
  functions replacing fragile `lambda ... or` pattern

---
Co-authored-by: Claude Code <noreply@anthropic.com>
…lly to @patch decorators

- Remove inline 'import unittest.mock as mock' from test body
- Add module-level 'from unittest.mock import MagicMock, patch'
- Three context-manager patches → three @patch decorators
- mock_trainer.side_effect replaces nested _fake_trainer closure

---
Co-authored-by: Claude Code <noreply@anthropic.com>
Borda
Borda previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Borda added 2 commits April 8, 2026 21:38
- guard private PTL launcher import with clear runtime error path\n- respect explicit CPU accelerator when gating compile/pin_memory\n- fix optimize_for_inference CUDA-context tests on CPU builds\n- add focused regression tests for launcher compatibility and accelerator overrides\n\nCo-authored-by: OpenAI Codex <codex@openai.com>
@Borda Borda merged commit a6a080e into roboflow:develop Apr 8, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

model.train(strategy="ddp_notebook") fails with "Cannot re-initialize CUDA in forked subprocess"

3 participants